Today we are going to talk about running some basic statistics in R.
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[30m── [1mAttaching packages[22m ──────────────────────────────────────────────────────────── tidyverse 1.3.0 ──[39m
[30m[32m✓[30m [34mggplot2[30m 3.2.1 [32m✓[30m [34mpurrr [30m 0.3.3
[32m✓[30m [34mtibble [30m 2.1.3 [32m✓[30m [34mdplyr [30m 0.8.3
[32m✓[30m [34mtidyr [30m 1.0.2 [32m✓[30m [34mstringr[30m 1.4.0
[32m✓[30m [34mreadr [30m 1.3.1 [32m✓[30m [34mforcats[30m 0.4.0[39m
[30m── [1mConflicts[22m ─────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
library(cowplot)
********************************************************
Note: As of version 1.0.0, cowplot does not change the
default ggplot2 theme anymore. To recover the previous
behavior, execute:
theme_set(theme_cowplot())
********************************************************
Usually, when I want to run stats on a thing, I start with some basic descriptive stats. Fortunately we did this last week with the starwars dataset, so lets jump back in.
data(starwars)
Lots of statistics are applied to continuous variables. Lets take a look at the relationship between height and mass. I’m going to make an interactive plot with ggplotly so we can mouse over points and see them better.
Here I included a name and spec variable, but didn’t map them to anything. Thats so you can see them when you mouse over them.
hmPlot <- starwars %>% ggplot(aes(x = height, y = mass, name = name, spec = species)) + geom_point() + theme_cowplot()
ggplotly(hmPlot)
Woah dude. One character is anomalously heavy. Mouse over the outlier so you can see who it is.
That was my first guess too
Jabba is probably going to screw up all of our stats, lets focus on the relationship of all of the other charactersby removing him.
starwars01 <- starwars %>% filter(!str_detect(name, "Jabba")) # Complecated syntac because I can't spell the rest of his name
hmPlot01 <- starwars01 %>% ggplot(aes(x = height, y = mass, name = name, spec = species)) + geom_point() + theme_cowplot()
ggplotly(hmPlot01)
So here’s our relationship. It looks sort of, but not reeally linear, which makes sense, there are lots of species in the galaxy.
Sometimes when we do stats, the stats like to imagine normally distributed data.
starwars01 %>% pivot_longer(cols = mass:height, names_to = "measurement", values_to = "value") %>%
ggplot(aes(x = value)) + facet_wrap(measurement~.) + geom_histogram()
Hmm. Sort of.
See if you can figure out what I did above.
Lets see if height and width are correlated.
cor.test(starwars01$height, starwars01$mass, method = "pearson")
Pearson's product-moment correlation
data: starwars01$height and starwars01$mass
t = 8.7853, df = 56, p-value = 4.018e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6260700 0.8520232
sample estimates:
cor
0.7612612
cor.test(starwars01$height, starwars01$mass, method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: starwars01$height and starwars01$mass
S = 6946.8, p-value = 2.597e-13
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.786313
Here are a parametric and non-parametric correlation test. The values I usually think about are cor (pearson), which is the R value, or rho(spearman) which is its “rho” which is a lot like an R value.
Numbers closer to 1 are more positively correlated, closer to -1 are more negatively correlated, closer to 0 are not correlated.
There is a p-value for both of these, which is essentially the probability that if all of the assumptions of the model hold, the true R or rho value is zero.
Regressions are also pretty popular with the kids these days. Lets do one! Lets see how well height explains mass
mod <- lm(mass ~ height, data = starwars01)
summary(mod)
Call:
lm(formula = mass ~ height, data = starwars01)
Residuals:
Min 1Q Median 3Q Max
-39.382 -8.212 0.211 3.846 57.327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.54076 12.56053 -2.591 0.0122 *
height 0.62136 0.07073 8.785 4.02e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.14 on 56 degrees of freedom
(28 observations deleted due to missingness)
Multiple R-squared: 0.5795, Adjusted R-squared: 0.572
F-statistic: 77.18 on 1 and 56 DF, p-value: 4.018e-12
Lets look at what this tells us. Residuals are essentially the distance of each point from the models prediction. The one that is farthest below is ~ -39 too low, the highest above is ~ 57 and the median and first and third quartiles are also shown.
Coefficients are often what you care about. The estemate tells us about the slope and intercept of the linear regression. These values basically say our model looks like this
mass = -32.5 + 0.62 * height
We don’t know the true values of these paramters, so the standard error gives us an idea of the ranges of the two. T-value is the test score to see how good those are, band the p-value tells you if the answer is statistically significant.
4.02 x 10^-12 < 0.002 so it the height value gets a bunch of stars after it. I usually don’t make too much of the p value of the intercept.
Residual standard error is the standard deviation of the residuals, according to the internet. 56 degrees of freedom is calculated as our sample size, minus the complexity of the model. We have 58 characters, but we loose one DF for the intercept and another for the height varaible that we are using.
There is also an R^2, and adjusted R^2 which gets smaller as the model gets more complicated, and a p-value for the whole model.
Linear models like to assume that the residuals are normally distributed. We want to make sure that no person is really affecting our model too much. We can plot the model to do this.
plot(mod)
the broom package gives a nice matrix version of model results
library(broom)
tidy(mod)
Lets see how well our model’s predictions compare to the actual data.
We can use the predict function to generate some predictions
# Make a data frame containing all of the possible heights, from the shortist to tallest character
predPreDf <- tibble(
height = min(na.omit(starwars01$height)): max(na.omit(starwars01$height))
)
# for each height, predict the mass
PredMass <- predict(mod, predPreDf) # predict requires a model, mod, and a data frame with the predictors in it. Our only predictor is height, so its just a data frame with one thing
# Stick the range of heights and predicted masses together into a data fraeme
predDf <- tibble(height = predPreDf$height, mass = PredMass)
# plot the original data
starwars01 %>% ggplot(aes(x = height, y = mass)) + geom_point() +
# plot the predicted values, notice that I include the predicted data in the "data" argument below
geom_path(aes(x = height, y = mass), data = predDf, color = "darkgreen")
There are a lot of species right now, but most of them only have a few characters. I’m interested in looking at things that differentiate humans from other species. So lets make a new column that indicates whether a character is human
starwars02 <- starwars01 %>% mutate(isHuman = ifelse(species == "Human", "Yes", "No"))
Lets see if humanity is a reasonable predictor of a characters’ mass.
First, lets plot the two against eachother
ggplot(aes(mass, isHuman), data = starwars02) + geom_point()
ok, one person has a species that is NA. It looks like humans tend to be intermediate in mass, with a lot of overlap of the non humans.
Lets ask if humans tend to have different mass than non humans. I’m betting not so much, since they seem to be right in the middle of the other organisms masses.
modMH <- lm(mass ~ isHuman, data = starwars02)
summary(modMH)
Call:
lm(formula = mass ~ isHuman, data = starwars02)
Residuals:
Min 1Q Median 3Q Max
-56.834 -15.634 -3.782 10.166 87.166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.834 4.909 14.634 <2e-16 ***
isHumanYes 10.948 7.901 1.386 0.171
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29.04 on 55 degrees of freedom
(29 observations deleted due to missingness)
Multiple R-squared: 0.03373, Adjusted R-squared: 0.01616
F-statistic: 1.92 on 1 and 55 DF, p-value: 0.1715
So with categorical variables, we can interperet this much the same way as in the last plot. We have a model that looks something like.
mass = 72 + 11 * isHumanYes
This essentially says that it the character was not human, we treat isHumanYes as 0, and if the character is human, we treat it as one. This says essentially that the average mass of a human is.
72 + 11 * 1 = 83
And for non humans 71 + 11 * 0 = 72.
That said, our p value for the isHumanYes term is 0.171, which is > 0.05, which means that even if there was no real trend, its pretty likely that we could have gotten at least this strong of a result.
We can make our model even more complicated by adding in gender.
plotHumanSpec <- ggplot(aes(mass, isHuman, shape = gender, name = name, species = species), data = starwars02) + geom_point(size = 2)
ggplotly(plotHumanSpec)
Among other things, it becomes clear that most characters in starwars are males.
starwars02 %>% group_by(isHuman, gender) %>% summarize(n = n())
One character has no gender and a few have NA for gender. And in the star wars universe, that is the limit to characters gender types. We won’t have the statistical power to look at more than male and female characters here, and so I’m going to limit the analysis just to just male, female, human, and nonhuman.
starwars03 <- starwars02 %>% filter(gender %in% c("male", "female"), isHuman %in% c("Yes", "No"))
modMHG <- lm(mass ~ isHuman + gender, data = starwars03)
summary(modMHG)
Call:
lm(formula = mass ~ isHuman + gender, data = starwars03)
Residuals:
Min 1Q Median 3Q Max
-62.021 -9.246 -1.614 5.386 81.979
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.614 8.993 5.739 5.55e-07 ***
isHumanYes 9.226 7.264 1.270 0.2099
gendermale 25.407 9.532 2.665 0.0103 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 25.98 on 50 degrees of freedom
(23 observations deleted due to missingness)
Multiple R-squared: 0.1565, Adjusted R-squared: 0.1228
F-statistic: 4.639 on 2 and 50 DF, p-value: 0.01419
So here we try a model with both isHuman and male. You would interpret this the same way. So a human female would have a predicted mass of.
mass = 52 + (9 * 1) + (25 * 0)
Our p value is above 0.05 for isHumanYes again though, so that variable doesn’t seem to be a statistically significant predictor.
Lets add an interaction term
modMHG2 <- lm(mass ~ gender * height , data = starwars03)
summary(modMHG2)
Call:
lm(formula = mass ~ gender * height, data = starwars03)
Residuals:
Min 1Q Median 3Q Max
-42.042 -6.435 -1.431 0.933 55.141
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.37255 100.63238 0.471 0.640
gendermale -73.26685 101.40535 -0.723 0.473
height 0.04321 0.59338 0.073 0.942
gendermale:height 0.55750 0.59735 0.933 0.355
Residual standard error: 16.68 on 49 degrees of freedom
(23 observations deleted due to missingness)
Multiple R-squared: 0.6594, Adjusted R-squared: 0.6386
F-statistic: 31.62 on 3 and 49 DF, p-value: 1.612e-11
So here, we predict with a continuous variable (height) a discrete variable (gender = male, 1 if yes, 0 if no) and the interaction between th two. In any case this isn’t a useful model, based on the p value for each of the terms. That said, the model wide p-value 1.61e-11 is really low. I’m not quite sure hot to interperet that. The model as a whole works, but no specific variable is a good predictor?
What if we have a binary variable as our y value, rather than as our x value. Then we want to use logistic regression.
starwars04 <- starwars03 %>% mutate(isHumanNum = ifelse(isHuman == "Yes", 1, 0))
modBinom <- glm(isHumanNum ~ mass, family = binomial, data = starwars04)
summary(modBinom)
Call:
glm(formula = isHumanNum ~ mass, family = binomial, data = starwars04)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5577 -1.0515 -0.8546 1.3087 1.5391
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.48217 0.89174 -1.662 0.0965 .
mass 0.01473 0.01085 1.358 0.1744
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 71.938 on 52 degrees of freedom
Residual deviance: 69.959 on 51 degrees of freedom
(23 observations deleted due to missingness)
AIC: 73.959
Number of Fisher Scoring iterations: 4
So in the above example, we are predicting the probability of being male from the data. Its not a statistically significant model, but lets plot the probability of being human from mass
massRange <- min(na.omit(starwars04$mass)):max(na.omit(starwars04$mass))
humanPred <- predict(modBinom, newdata = tibble(mass = massRange), type = "response") # for glm its called newdata, not data
predDfMG <- tibble(mass = massRange, isHumanNum = humanPred)
predDfMG
ggplot(aes(x = mass, y = isHumanNum), data = starwars04) + geom_point() +
geom_path(data = predDfMG)
So here, the line indicates the probability, based on our model, that a character is human, based on their mass.
I notice though that humans tend to have intermediate mass. Lets use a polynomial regression to address this. To do this, I include a squared term in the model.
modBinom2 <- glm(isHumanNum ~ mass + I(mass^2), family = binomial, data = starwars04)
summary(modBinom2)
Call:
glm(formula = isHumanNum ~ mass + I(mass^2), family = binomial,
data = starwars04)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3249 -1.1479 -0.4838 1.1410 1.9263
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.5908502 3.0668834 -2.149 0.0316 *
mass 0.1414293 0.0692430 2.043 0.0411 *
I(mass^2) -0.0007204 0.0003791 -1.900 0.0574 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 71.938 on 52 degrees of freedom
Residual deviance: 64.331 on 50 degrees of freedom
(23 observations deleted due to missingness)
AIC: 70.331
Number of Fisher Scoring iterations: 5
Hey, This looks ok!
Lets plot it!
massRange <- min(na.omit(starwars04$mass)):max(na.omit(starwars04$mass))
humanPred <- predict(modBinom2, newdata = tibble(mass = massRange), type = "response") # for glm its called newdata, not data
predDfMG <- tibble(mass = massRange, isHumanNum = humanPred)
predDfMG
ggplot(aes(x = mass, y = isHumanNum), data = starwars04) + geom_point() +
geom_path(data = predDfMG)
This actually makes ok sense. At intermediate masses, things are around 50% likely to be human. At large and small masses, they are a lot less likely to be human.